Okay. So the last thing I want to do in machine learning is what's called statistical learning,
where we're essentially using Bayesian network techniques for learning. We've been in a sense
using Bayesian networks as models of the world which can be used to make predictions about the
world, probabilistic predictions about the world. The problem there of course is that where do the
networks come from? How do we learn such networks? No theory really is complete without giving
ourselves a way of learning those. The answer of course is we learn the networks. I want to show
you how that works kind of as the logical conclusion of all the probabilistic stuff we've
done, we've been doing in the past. And the idea is relatively simple. The idea of Bayesian learning
is that rather than trying to get a kind of a zero-one having the best hypothesis, why not use
probabilistic methods to learn a probability distribution on the hypothesis space? Give
yourself a little bit of softness there. And the idea is that rather than using these hard methods
of weeding out hypotheses that we've been doing in learning here, what you want to do is you want to
think of the hypothesis as a random variable and the examples as observations or data of another
random variable that are connected somehow and see how we can use our methods that we've developed
before for that. So we have a hypothesis variable and that may or may not have a known prior
distribution. That's what you know when you have seen no examples. Then you have a prior for that
hypothesis variable and then you get the observations DJ in, which is really the outcome of some other
random variable. And we have the training data as observations. Sounds a little bit like what we've
seen before. And so we can use those to get the posterior probability of the one hypothesis HI
given the data so far. And we can use our usual methods and the Bayesian methods to get this
estimations. Normalization constants times the probability of the data given the hypothesis.
We've used Bayes rule here to turn around the condition probabilities times the prior probability
of the hypothesis itself. There's something we know here. So this term here we'll call the likelihood
of the data given the hypothesis. And we can now make predictions of a certain outcome of a property
X, which we do that by summing out over the hypotheses and giving us this. And then we have
an independence here of the data of course. And that gives us essentially this thing here.
The nice thing about this is we do not have to pick a best hypothesis. We can make predictions
without hypothesis choice. And we kind of evolve the hypothesis is kind of hidden in the word work.
We evolve that at the same time as our data comes in. Here's an example. So we have a new kind of
sweets on the market. And since sweets is commodity, people make new inventions about them
that make them more attractive. Here the invention is that we add an element of surprise. You have
the company sells five kinds of bags of these where we have one kind of bag where we have two
kinds of sweets, lime and cherry. And for the sake of argument, we prefer cherry over lime.
So we have these bags. These bags all look the same and their contents also looks the same
because they're wrapped. And once you unwrap them, you can see whether it's lime or cherry.
And we have five kinds of bags. One is all cherry. One is all lime. We have two bags that have
three quarters limes or three quarters cherries. And then we have one 50-50 bag. And there's a
prior here with a prior distribution is this 10, 20, 40, 20, 10. Adds up nicely to 100%.
That's the setup. We're interested in a couple of things. One is if I've unwrapped a couple of lime,
a couple of candies, what's the probability that the next candy is actually cherry,
which we prefer? That's one. Or what's the probability of my bag that I've bought and that
has no indication of what it is? On the outside, what's the probability given that we've unwrapped
a couple of candies? What's the probability being in one of these five classes? Those are the things
that we're interested in. And we're assuming the usual stuff we're assuming, namely that the whole
thing is IID, independently and identically distributed. And we can ensure those kind of
things by making the bags very big. Or we can assure that by unwrapping the candy and rewrapping
it and putting it back into the bag, which we prefer not to do. We'd rather have big bags of
candy, even if that doesn't give you full IID, but near enough. So what do we do? If we actually
take this prediction formula or the hypothesis formula, let's run an experiment. Namely,
what happens if we only find lines? And here's the results. If we have zero candies unwrapped,
Presenters
Zugänglich über
Offener Zugang
Dauer
00:13:26 Min
Aufnahmedatum
2021-03-30
Hochgeladen am
2021-03-30 17:27:55
Sprache
en-US
The Candy Flavors Example to introduce Full Bayesian Learning and its properties.